5 research outputs found
A Sweet Recipe for Consolidated Vulnerabilities: Attacking a Live Website by Harnessing a Killer Combination of Vulnerabilities
The recent emergence of new vulnerabilities is an epoch-making problem in the
complex world of website security. Most of the websites are failing to keep
updating to tackle their websites from these new vulnerabilities leaving
without realizing the weakness of the websites. As a result, when
cyber-criminals scour such vulnerable old version websites, the scanner will
represent a set of vulnerabilities. Once found, these vulnerabilities are then
exploited to steal data, distribute malicious content, or inject defacement and
spam content into the vulnerable websites. Furthermore, a combination of
different vulnerabilities is able to cause more damages than anticipation.
Therefore, in this paper, we endeavor to find connections among various
vulnerabilities such as cross-site scripting, local file inclusion, remote file
inclusion, buffer overflow CSRF, etc. To do so, we develop a Finite State
Machine (FSM) attacking model, which analyzes a set of vulnerabilities towards
the road to finding connections. We demonstrate the efficacy of our model by
applying it to the set of vulnerabilities found on two live websites.Comment: Accepted at 5th International Conference on Networking, Systems and
Security (5th NSysS 2018
bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
Despite the existence of numerous Optical Character Recognition (OCR) tools,
the lack of comprehensive open-source systems hampers the progress of document
digitization in various low-resource languages, including Bengali. Low-resource
languages, especially those with an alphasyllabary writing system, suffer from
the lack of large-scale datasets for various document OCR components such as
word-level OCR, document layout extraction, and distortion correction; which
are available as individual modules in high-resource languages. In this paper,
we introduce BengaliAI-BRACU-OCR (bbOCR): an open-source scalable document
OCR system that can reconstruct Bengali documents into a structured searchable
digitized format that leverages a novel Bengali text recognition model and two
novel synthetic datasets. We present extensive component-level and system-level
evaluation: both use a novel diversified evaluation dataset and comprehensive
evaluation metrics. Our extensive evaluation suggests that our proposed
solution is preferable over the current state-of-the-art Bengali OCR systems.
The source codes and datasets are available here:
https://bengaliai.github.io/bbocr
BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
While strides have been made in deep learning based Bengali Optical Character
Recognition (OCR) in the past decade, the absence of large Document Layout
Analysis (DLA) datasets has hindered the application of OCR in document
transcription, e.g., transcribing historical documents and newspapers.
Moreover, rule-based DLA systems that are currently being employed in practice
are not robust to domain variations and out-of-distribution layouts. To this
end, we present the first multidomain large Bengali Document Layout Analysis
Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples
from six domains - i) books and magazines, ii) public domain govt. documents,
iii) liberation war documents, iv) newspapers, v) historical newspapers, and
vi) property deeds, with 710K polygon annotations for four unit types:
text-box, paragraph, image, and table. Through preliminary experiments
benchmarking the performance of existing state-of-the-art deep learning
architectures for English DLA, we demonstrate the efficacy of our dataset in
training deep learning based Bengali document digitization models
OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking
We present OOD-Speech, the first out-of-distribution (OOD) benchmarking
dataset for Bengali automatic speech recognition (ASR). Being one of the most
spoken languages globally, Bengali portrays large diversity in dialects and
prosodic features, which demands ASR frameworks to be robust towards
distribution shifts. For example, islamic religious sermons in Bengali are
delivered with a tonality that is significantly different from regular speech.
Our training dataset is collected via massively online crowdsourcing campaigns
which resulted in 1177.94 hours collected and curated from native
Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of
speech collected and manually annotated from 17 different sources, e.g.,
Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to
name a few. OOD-Speech is jointly the largest publicly available speech
dataset, as well as the first out-of-distribution ASR benchmarking dataset for
Bengali